Intro

This notebook includes the ensemble model analysis performed on the models generated by the Gitsbe module when running the DrugLogics computational pipeline for finding synergistic drug combinations (drug pairs). All these models were trained towards a specific steady state signaling pattern that was derived based on input data (gene expression, CNV) for the A498 cell line, the use of the PARADIGM software and a topology that was build for simulating a cancer cell fate decision network. The input for the simulations and the output data are in the cell-lines-2500 directory (the 2500 number denotes the number of simulations executed). The analysis will be presented step by step in the sections below.

The R version used for this analysis is:

R.version.string
## [1] "R version 3.4.4 (2018-03-15)"

Prerequisites

Firstly, we load the required libraries (you need to install them if you don’t have them):

library(rje) # version 1.9
library(igraph) # version 1.2.2, see (Csardi, 2006)
library(superheat) # version 1.0.0, see (Barter, 2017)

and the relevant helper functions:

# Set the working directory to the gitsbe-model-analysis folder: 
# setwd("pathTo/gitsbe-model-analysis")
source("Rscripts/input_functions.R")
source("Rscripts/output_functions.R")
source("Rscripts/analysis_functions.R")
source("Rscripts/plot_functions.R")

Input

We will define the name of the cell line which must match the name of the directory that has the input files inside the cell-lines-2500 directory. Our analysis in this notebook will be done on the data for the A498 cell line:

cell.line = "A498"
data.dir = paste0(getwd(), "/cell-lines-2500/", cell.line, "/")

Three inputs are used in this analysis:

model.predictions.file = paste0(data.dir, "model_predictions")
observed.synergies.file = paste0(data.dir, "observed_synergies")
models.dir = paste0(data.dir, "models")

Now, we parse the data into proper R objects. First the synergy predictions per model:

model.predictions = get.model.predictions(model.predictions.file)
head(model.predictions)

So, we can see that our dataset has the models as rows and each column is a different drug combination that was tested in our simulations.

drug.combinations.tested = colnames(model.predictions)
models                   = rownames(model.predictions)
nodes                    = get.node.names(models.dir)

number.of.drug.comb.tested = length(drug.combinations.tested)
number.of.models           = length(models)
number.of.nodes            = length(nodes)

print.model.and.drug.stats(number.of.drug.comb.tested, number.of.models, 
                           number.of.nodes)
## [1] "Drug combinations tested: 153 Number of models: 7500 Number of nodes: 139"

Next, we get the full stable state and the equations per model:

models.stable.state = get.stable.state.from.models.dir(models.dir)
models.stable.state = models.stable.state[models,]
head(as.data.frame(models.stable.state))

The rows of the above dataset represent the models while the columns are the names of the nodes (proteins, genes, etc.) of the cancer cell network under study. So, each model has one stable state which means that in every model, the nodes in the network have reached a state of either 0 (inhibition) or 1 (activation).

models.equations = get.equations.from.models.dir(
  models.dir, remove.equations.without.link.operator = TRUE)
models.equations = models.equations[models,]
head(as.data.frame(models.equations))

For the equations, if we look at a specific row (a model so to speak), the columns (node names) correspond to the targets of regulation (and the network has been built so that every node is a target - i.e. it has other nodes activating and/or inhibiting it). The general form of a boolean equation is: Target *= (Activator OR Activator OR…) AND NOT (Inhibitor OR Inhibitor OR…).

The difference between the models’ boolean equations is the link operator (OR NOT/AND NOT) which has been mutated (changed) through the evolutionary process of the genetic algorithm in Gitsbe. For example, if a model has for the column ERK_f (of the models.equations object) a value of 1, the correspoding equation is: ERK_f *= (MEK_f) OR NOT ((DUSP6) OR PPP1CA). A value of 0 would correspond to the same equation but having AND NOT as the link operator: RK_f *= (MEK_f) AND NOT ((DUSP6) OR PPP1CA). Note that the equations that do not have link operators (meaning that they are the same for every model) are discarded (so less columns in this dataset) since in a later section we study only the equations whose link operators differentiate between the models.

Lastly, the synergies observed for this particular cell line are:

observed.synergies = get.observed.synergies(
  observed.synergies.file, drug.combinations.tested)
number.of.observed.synergies = length(observed.synergies)

print(paste("Number of synergies observed:", number.of.observed.synergies))
## [1] "Number of synergies observed: 17"
observed.synergies
##  [1] "AK-60" "AK-BI" "AK-D1" "PI-D1" "PD-G2" "AK-G4" "D1-G4" "PI-JN"
##  [9] "BI-P5" "PD-P5" "PI-P5" "AK-PD" "BI-PD" "AK-PI" "BI-PI" "PD-PI"
## [17] "PK-ST"

Analysis

Performance Statistics

It will be interesting to know the percentage of the above observed synergies that were actually predicted by at least one of the models (there might be combinations that no model in our dataset could predict):

# Subset the model.data to the observed synergies
observed.model.predictions = 
  model.predictions[,sapply(drug.combinations.tested, function(drug.comb) {
    is.correct.synergy(drug.comb, observed.synergies)
  })]

# Subset the model.data to the unobserved synergies
unobserved.model.predictions = 
  model.predictions[,sapply(drug.combinations.tested, function(drug.comb) {
    !is.correct.synergy(drug.comb, observed.synergies)
  })]

stopifnot(dim(observed.model.predictions)[2] + 
          dim(unobserved.model.predictions)[2] == number.of.drug.comb.tested)

number.of.models.per.observed.synergy = 
  apply(observed.model.predictions, 2, sum, na.rm = T)
predicted.synergies = names(number.of.models.per.observed.synergy)[
                            number.of.models.per.observed.synergy > 0]

number.of.models.per.observed.synergy
## AK-BI AK-PD AK-PI AK-D1 AK-60 AK-G4 BI-PD BI-PI BI-P5 PD-PI PD-G2 PD-P5 
##     0     3     0     0     0     0    51   125     0   110     0     0 
## PI-JN PI-D1 PI-P5 PK-ST D1-G4 
##     0  3980     0     0     0
predicted.synergies
## [1] "AK-PD" "BI-PD" "BI-PI" "PD-PI" "PI-D1"
predicted.synergies.percentage = 100 * length(predicted.synergies) /
                    number.of.observed.synergies
print(paste0("Percentage of True Positive predicted synergies: ", 
             specify.decimal(predicted.synergies.percentage, 2), "%"))
## [1] "Percentage of True Positive predicted synergies: 29.41%"

So, for this particular cell line, there were indeed observed synergies that no model could predict (e.g. AK-BI, AK-PI). Next, we would like to know the maximum number of observed synergies predicted by one model alone - can one model by itself predict all the true positive synergies predicted by all the models together or do we need many models to capture this diverse synergy landscape? To do that, we go even further and count the number of models that predict a specific set of observed synergies for every possible combination subset of the predicted.synergies object:

# Find the number of predictive models for every synergy subset
predicted.synergies.powerset = powerSet(predicted.synergies)
predicted.synergies.powerset = predicted.synergies.powerset[
  order(sapply(predicted.synergies.powerset, length))
]
names(predicted.synergies.powerset) = 
  sapply(predicted.synergies.powerset, function(drug.comb.set) {
    paste(drug.comb.set, collapse = ",")
  }) 
synergy.subset.stats = 
  sapply(predicted.synergies.powerset, function(drug.comb.set) {
    count.models.that.predict.synergy.set(drug.comb.set, observed.model.predictions) 
})

# Bar plot of the number of models for every possible observed synergy combination set
# Tweak the threshold.for.subset.removal and bottom.margin as desired
make.barplot.on.synergy.subset.stats(synergy.subset.stats,
                                     threshold.for.subset.removal = 1, 
                                     bottom.margin = 9, cell.line)

From the above figure (where we excluded sets of synergies that were predicted by no model by setting the threshold.for.subset.removal value to 1) we observe that:

  • Almost half of the models predict no synergies
  • The PI-D1 synergy is predicted by almost all the rest of the models.

Next we calculate the maximum number of correctly predicted synergies (\(TP\) - True Positives) per model:

# Count the predictions of the observed synergies per model (TP)
models.synergies.tp = apply(observed.model.predictions, 1, sum, na.rm = T)
models.synergies.tp.stats = table(models.synergies.tp)

# Bar plot of number of models vs correctly predicted synergies
make.barplot.on.models.stats(models.synergies.tp.stats, cell.line, 
                            title = "True Positive Synergy Predictions",
                            xlab = "Number of maximum correctly predicted synergies",
                            ylab = "Number of models")

To summarize:

  • There were only 2 models that predicted 3 synergies - the set BI-PD,PD-PI,PI-D1 - which is the maximum number of predicted synergies by an individual model
  • No model could predict all 5 of the total predicted synergies.

The power of the ensemble model approach lies in the fact that (as we saw from the above figures) even though we may not have individual super models that can predict many observed drug combinations, there are many that predict at least one and which will be used by the drug response analysis module (Drabme) to better infer the synergistic drug combinations. It goes without saying though, that the existance of models that could predict more than a handful of synergies would be beneficial for any approach that performs drug response analysis on a multitude of models.

Biomarker analysis

Intro-Methods

Now, we want to investigate and find possible important nodes - biomarkers - whose activity state either distinguishes good performance models from less performant ones (in terms of a performance metric - e.g. the true positive synergies predicted) or makes some models predict a specific synergy compared to others that can’t (but could predict other synergies). So, we devised two strategies to split the models in our disposal to good and bad ones (but not necessarily all of them), the demarcation line being either a performance metric (the number of \(TP\) or the Matthews Correlation Coefficient score) or the prediction or not of a specific synergy. Then, for each group of models (labeled as either good or bad) we find the average activity state of every node in the network (value between 0 and 1) and then we compute the average state difference for each node between the two groups: \(\forall i\in nodes,mean(state_i)_{good} - mean(state_i)_{bad}\). Our hypothesis is that if the absolute value of these average differences are larger than a user-defined threshold (e.g. 0.7) for a small number of nodes (the less the better) while for the rest of the nodes they remain close to zero, then the former nodes are considered the most important since they define the difference between the average bad model and the average good one in that particular case study. We will also deploy a network visualization method to observe these average differences.

True Positives-based analysis

Using our first strategy, we will split the models based on the number of true positive predictions. For example, the bad models will be the ones that predicted 0 \(TP\) synergies whereas the good models will be the ones that predicted 2 \(TP\) (we will denote the grouping as \((0,2)\)). This particular classification strategy will be used for every possible combination of the number of \(TP\) as given by the models.synergies.tp.stats object and the density estimation of the nodes’ average state differences in each case will be ploted in a common graph:

tp.values = as.numeric(names(models.synergies.tp.stats))
tp.values.comb = t(combn(tp.values, 2))

diff.tp.results = apply(tp.values.comb, 1, function(comb) {
  return(get.avg.activity.diff.based.on.tp.predictions(
    models, models.synergies.tp, models.stable.state, 
    num.low = comb[1], num.high = comb[2]))
})

tp.comb.names = apply(tp.values.comb, 1, function(row) {
  return(paste0("(", paste(row, collapse = ","), ")")) 
})
colnames(diff.tp.results) = tp.comb.names
diff.tp.results = t(diff.tp.results)

tp.densities = apply(abs(diff.tp.results), 1, density)
make.multiple.density.plot(tp.densities, legend.title = "True Positives", 
                           x.axis.label = "Activity state (absolute difference value)",
                           title = "Density Estimation for the Average State Difference")

What we are actually looking for is density plots that are largely skewed to the right (so the average absolute differences of these nodes are close to zero) while there are a few areas of non-zero densities which are as close to 1 as possible. So, from the above graph, the density plots that fit this description are the ones marked as \((1,3)\) and \((2,3)\). We will visualize the nodes’ average state differences in a network graph, where the color of each node will denote how much more inhibited or active that node is, in the average good model vs the average bad one. The color of the edges will denote activation (green) or inhibition (red). We first build the network from the node topology (edge list):

parent.dir = get.parent.dir(data.dir)
topology.file = paste0(parent.dir, "/topology")
coordinates.file = paste0(parent.dir, "/network_xy_coordinates")

net = contruct.network(topology.file, models.dir)

# a static layout for plotting the same network always (igraph)
# nice.layout = layout_nicely(net)
nice.layout = as.matrix(read.table(coordinates.file))

In the next colored graphs we can identify the important nodes whose activity state can influence the true positive prediction performance (from 0 true positive synergies to a total of 3):

# useful tutorial about network visualization in R: (Ognyanova, 2018)

diff.tp.0.3 = diff.tp.results[3,]
plot.network(net, diff.tp.0.3, layout = nice.layout,
             title = "Bad models (0 TP) vs Good models (3 TP)")

diff.tp.1.3 = diff.tp.results[5,]
plot.network(net, diff.tp.1.3, layout = nice.layout,
             title = "Bad models (1 TP) vs Good models (3 TP)")

diff.tp.2.3 = diff.tp.results[6,]
plot.network(net, diff.tp.2.3, layout = nice.layout,
             title = "Bad models (2 TP) vs Good models (3 TP)")

We will now list the important nodes that affect the models’ prediction performance with regards to the number of true positive synergies that they predict. Comparing the graphs above, we observe that there exist common nodes that maintain the same significant influence in all of the graphs. Setting the threshold for the absolute significance level in average state differences to \(0.7\) and taking the middle case scenario as the representative one (1 \(TP\) vs 3 \(TP\)) we find the nodes that have to be in a more active state:

threshold = 0.7
biomarkers.tp.active = diff.tp.1.3[diff.tp.1.3 > threshold]
pretty.print.vector.names(biomarkers.tp.active)
## [1] "13 nodes: PSEN1,CEBPA,MAPK8IP1,MAPK9,JAK1,TYK2,JAK3,IFNGR2/INFGR1,IFNGR1,PTPN11,IFNGR2,IL2RB,IL10RA"

Also, the nodes that have to be in a more inhibited state are:

biomarkers.tp.inhibited = diff.tp.1.3[diff.tp.1.3 < -threshold]
pretty.print.vector.names(biomarkers.tp.inhibited)
## [1] "17 nodes: MAP3K7,MAP2K6,MAP2K3,NLK,IKBKG,STAT3,RXRA,SOCS3,TGFB1,HSPA1A,SALL4,ROCK1,TGFBR1,TRAF6,RHOA,PIK3R1,CASP9"

MCC classification-based analysis

The previous method to split the models based on the number of true positive predictions is a good metric for absolute performance but a very restricted one since it ignores the other values of the confusion matrix for each model (true negatives, false positives, false negatives). Also, since our dataset is imbalanced in the sense that out of the total drug combinations tested only a few of them are observed as synergistic (and in a hypothetical larger drug screening evaluation it will be even less true positives) we will now devise a method to split the models into different performance categories based on the value of the Matthews Correlation Coefficient (MCC) score which takes into account the balance ratios of all the four confusion matrix values:

# Count the false negatives (FN)
models.synergies.fn = apply(observed.model.predictions, 1, function(x) {
  sum( x == 0 | is.na(x) )
})

# P = TP + FN (Positives)
positives = ncol(observed.model.predictions)
models.synergies.p = models.synergies.tp + models.synergies.fn

# Count the predictions of the non-observed synergies per model (FP)
models.synergies.fp = apply(unobserved.model.predictions, 1, sum, na.rm = T)

# Count the True Negatives (TN)
models.synergies.tn = apply(unobserved.model.predictions, 1, function(x) {
  sum( x == 0 | is.na(x))
})

# N = FP + TN (Negatives)
negatives = ncol(unobserved.model.predictions)
models.synergies.n = models.synergies.fp + models.synergies.tn

# checks
stopifnot(models.synergies.p == positives)
stopifnot(models.synergies.n == negatives)
stopifnot(positives + negatives == number.of.drug.comb.tested)

# Calculate Matthews Correlation Coefficient (MCC)
models.mcc = calculate.mcc(models.synergies.tp, models.synergies.tn, 
                           models.synergies.fp, models.synergies.fn,
                           positives, negatives)
models.mcc.stats = table(models.mcc, useNA = "ifany")

make.barplot.on.models.stats(models.mcc.stats, cell.line, title = "MCC scores", 
                             xlab = "MCC value", ylab = "Number of models", 
                             cont.values = TRUE)

From the above figure we observe that:

  • There are no relatively bad models (MCC values close to -1)
  • Most of the models perform a little better than random prediction (\(MCC>0\)).
  • There are models that had NaN value for the MCC score.

Given the MCC formula: \(MCC = (TP\cdot TN - FP\cdot FN)/\sqrt{(TP+FP) * (TP+FN) * (TN+FP) * (TN+FN)}\), we can see that the MCC value can be NaN because of zero devision. Two of the four values in the denominator represent the number of positive \((TP+FN)\) and negative \((TN+FP)\) observations which are non-zero for every model, since they correspond to the observed and non-obsered synergies in each case. The case where both \(TN\) and \(FN\) are zero is rare (if non-existent) because of the imbalanced dataset (large proportion of negatives) and the reason that logical models which report no negatives means that they should find fixpoint attractors for every possible drug combination perturbation which also is extremely unlikely. We can actually see that the NaN are produced by models that have both TP and FP equal to zero:

sum(models.synergies.tp + models.synergies.fp == 0)
## [1] 2074

Since these models could intentify no synergies (either correctly or wrongly), we decided to put them as the lowest performant category in our MCC-based analysis. To classify the models based on the MCC score (which takes values in the \([-1, 1]\) interval), we devised a method that splits the previously found MCC values range to intervals of a specific size (each corresponding to a different classification category), so that each model’s MCC score falls within a specific interval:

mcc.values = as.numeric(names(models.mcc.stats))

# split into classes with a 0.2 MCC value range each
mcc.intervals = get.mcc.intervals(mcc.values, interval.size = 0.2)

# add NaN category (if applicable)
if (sum(is.na(mcc.values)) > 0) {
  mcc.intervals = rbind(c(NaN, NaN), mcc.intervals)
}

mcc.classes = get.mcc.classes(mcc.intervals)
print.mcc.classification.info(mcc.classes)
## [1] "MCC values are split into 4 classes:"
## [1] "1. NaN"
## [1] "2. [-0.2, 0)"
## [1] "3. [0, 0.2)"
## [1] "4. [0.2, 0.4]"

Following our first strategy, we will split the models based on the MCC performance metric score. For example, the bad models will be the ones that had a NaN MCC score \((TP+FP = 0)\) whereas the good models will be the ones that had an MCC score between \([-0.2, 0)\). This particular classification strategy will be used for every possible combination of the MCC classes as defined by the mcc.classes object and the density estimation of the nodes’ average state differences in each case will be ploted in a common graph:

mcc.class.id.comb = t(combn(1:length(mcc.classes), 2))

diff.mcc.results = apply(mcc.class.id.comb, 1, function(comb) {
  return(get.avg.activity.diff.based.on.mcc.classification(
    models, models.mcc, mcc.intervals, models.stable.state, 
    class.id.low = comb[1], class.id.high = comb[2]))
})

mcc.classes.comb.names = apply(mcc.class.id.comb, 1, function(comb) {
  return(paste0("(", mcc.classes[comb[1]], ", ", mcc.classes[comb[2]], ")"))
})

colnames(diff.mcc.results) = mcc.classes.comb.names
diff.mcc.results = t(diff.mcc.results)

mcc.densities = apply(abs(diff.mcc.results), 1, density)
make.multiple.density.plot(mcc.densities, legend.title = "MCC classes",
                           x.axis.label = "Activity state (absolute difference value)",
                           title = "Density Estimation for the Average State Difference")

From the above graph we notice that the NaN MCC class has almost the same density distribution of the absolute average state differences when compared to all the other MCC classes, meaning that this truly represents a (bad) class of models which are fundamentally different from all other models that have at least one \(TP\) or \(FP\) prediction. The density plots of interest here are the ones that compare the 2nd vs 4th class \(([-0.2, 0), [0.2, 0.4])\) and the 3rd vs 4th class \(([0, 0.2), [0.2, 0.4])\). Next, we visualize the average state differences with our network coloring method, in order to identify the important nodes whose activity state can influence the prediction performance based on the MCC classification:

# ([-0.2, 0), [0.2, 0.4])
diff.mcc.2.4 = diff.mcc.results[5,]
plot.network(net, diff.mcc.2.4, layout = nice.layout, 
             title = "Bad models, MCC: [-0.2, 0) vs Good models, MCC: [0.2, 0.4]")

# ([0, 0.2), [0.2, 0.4])
diff.mcc.3.4 = diff.mcc.results[6,]
plot.network(net, diff.mcc.3.4, layout = nice.layout, 
             title = "Bad models, MCC: [0, 0.2) vs Good models, MCC: [0.2, 0.4]")

We will now list the important nodes that affect the models’ prediction performance with regards to their MCC classification, using the average state difference results from comparing the 2nd to the 4th MCC class. First, the nodes that have to be in a more active state:

biomarkers.mcc.active = diff.mcc.2.4[diff.mcc.2.4 > threshold]
pretty.print.vector.names(biomarkers.mcc.active)
## [1] "11 nodes: PTEN,JAK1,TYK2,JAK3,IFNGR2/INFGR1,IFNGR1,PTPN11,IFNGR2,IL2RB,IL10RA,PPM1A"

Secondly, the nodes that have to be in a more inhibited state:

biomarkers.mcc.inhibited = diff.mcc.2.4[diff.mcc.2.4 < -threshold]
pretty.print.vector.names(biomarkers.mcc.inhibited)
## [1] "18 nodes: MAP3K7,MAP2K6,MAP2K3,NLK,IKBKG,CREB1,STAT3,PtsIns(3,4,5)P3,PTK2,SOCS3,TGFB1,HSPA1A,SALL4,ROCK1,TGFBR1,TRAF6,RHOA,PIK3R1"

Equation-based analysis

It will be interesting to see the different patterns in the form of the boolean equations (regarding the mutation of the link operator as mentioned in the Input section) when comparing higher performance models vs the low performant ones. We could also check if any of the biomarkers found above relate to a different link operator on average between models with different performance characteristics (e.g. higher predictive models should have the OR NOT as the link operator in a boolean equation where a specific biomarker is the regulation target) or if they constitute targets of exclusively activating nodes or inhibiting ones (an equation with no link operator).

The performance metric we will first use to sort the models is the number of true positive predictions:

number.of.unique.tp.values = length(tp.values)
rbPal = colorRampPalette(c("red", "black", "green"))
tp.colors = rbPal(number.of.unique.tp.values)
models.synergies.tp.colors = tp.colors[
  as.numeric(cut(models.synergies.tp, breaks = number.of.unique.tp.values))
]
names(models.synergies.tp.colors) = models

title = "Color Palette"
xlab  = "Number of True Positives"
make.color.bar.plot(color.vector = tp.colors, 
                    number.vector = tp.values, 
                    title, xlab)

So, we will now illustrate the heatmap of the models.equations object raw-order by the number of \(TP\) predictions:

# order based on number of true positives
models.synergies.tp.sorted = sort(models.synergies.tp)
models.sorted = names(models.synergies.tp.sorted)

models.equations.sorted = models.equations[models.sorted, ]
models.synergies.tp.colors.sorted = models.synergies.tp.colors[models.sorted]

# color biomarkers names in the heatmap
bottom.nodes.colors = rep("black", length(colnames(models.equations.sorted)))
names(bottom.nodes.colors) = colnames(models.equations.sorted)
bottom.nodes.colors[names(bottom.nodes.colors) %in% 
                    common.biomarkers.active] = "green"
bottom.nodes.colors[names(bottom.nodes.colors) %in% 
                    common.biomarkers.inhibited] = "red"

superheat(models.equations.sorted, title.alignment = "center",
          title = "Models link operators (AND/OR NOT)", 
          row.title = "Models (TP sorted)", smooth.heat = TRUE,
          row.dendrogram = FALSE, col.dendrogram = TRUE, scale = FALSE,
          clustering.method = "hierarchical", dist.method = "euclidean",
          force.left.label = TRUE, left.label.text.size = 0, 
          left.label.size = 0.05, 
          left.label.col = models.synergies.tp.colors.sorted,
          heat.pal = c("red","gold"),
          legend.breaks = c(0, 1), 
          grid.hline = FALSE, grid.vline = FALSE,
          bottom.label.text.size = 3, bottom.label.text.angle = 90, 
          bottom.label.col = "white", bottom.label.text.col = bottom.nodes.colors)

In the heatmap above, an equation whose link operator is AND NOT is represented with red color while an OR NOT link operator is represented with gold color. The targets whose equations do not have a link operator are not represented. The rows are ordered by the number of true positive predictions (ascending). We have colored the names of the network nodes that were also found as biomarkers (green color is used for the active biomarkers and red color for the inhibited biomarkers). We observe that:

  • Most of the biomarkers are nodes that do not have both activators and inhibitors and so are absent from the above heatmap
  • There doesn’t seem to exist a pattern between the models’ link operators and their corresponding performance (at least not for all of the nodes) when using the true positive predictions as a classifier for the models
  • There exist a lot of target nodes that need to have the OR NOT link operator in their respective boolean equation in order for the corresponding logical model to show a higher number of true positive predictions. By assigning the OR NOT link operator to a target’s boolean regulation equation, we allow more flexibility to the target’s output active result state - meaning that the inhibitors play less role and the output state has a higher probability of being active - compared to assigning the AND NOT link operator to the equation.

We will also illustrate the heatmap of the models.equations object raw-order by the MCC score which is a better performance classifier. Models who had a NaN MCC score will be again placed in the lower performant category:

# order based on the MCC value
models.mcc.sorted = sort(models.mcc, na.last = FALSE)
models.sorted = names(models.mcc.sorted)

models.equations.sorted = models.equations[models.sorted, ]

superheat(models.equations.sorted, title.alignment = "center",
          title = "Models link operators (AND/OR NOT)", 
          row.title = "Models (MCC sorted)", smooth.heat = TRUE,
          row.dendrogram = FALSE, col.dendrogram = TRUE, scale = FALSE,
          clustering.method = "hierarchical", dist.method = "euclidean",
          yr = models.mcc.sorted, yr.plot.type = "line", yr.axis.name = "MCC score",
          yr.line.col = "cornflowerblue", yr.plot.size = 0.2,
          heat.pal = c("red","gold"),
          legend.breaks = c(0, 1), 
          grid.hline = FALSE, grid.vline = FALSE,
          bottom.label.text.size = 3, bottom.label.text.angle = 90, 
          bottom.label.col = "white", bottom.label.text.col = bottom.nodes.colors)

In the MCC-ordered heatmap above we observe that there is better correlation between the boolean equations’ link operators and the performance of a model compared to the \(TP\) classification. For example, the biomarkers STAT3 and JAK1 show distinguished patterns for the higher performance models (use of AND NOT and OR NOT link operators respectively) when the models are sorted by the MCC score.

Synergy-prediction based analysis

We will now use the second strategy to split the models, based on whether they predict a specific observed synergy or not. This will allow us to find biomarkers that affect the prediction of a specific synergy. For example, the good models could be the ones that predicted the hypothetical synergy A-B while the bad models all the rest that identified the particular combination as non-synergistic. In another case scenario, the good models could be those that predicted a triple synergy set A-B,C-D,A-C, while the bad models could be the ones that predicted the double synergy subset A-B,C-D (excluding the common models that predicted both the triple synergy set and subsequently its given subset). In such a case scenario, we want to find out which nodes are responsible for making the good models predict the extra synergy - in this hypothetical case the synergy A-C - demonstrating thus better model performance. Note that the models selected in each case as good or bad, could have predicted other synergies as well (correctly as \(TP\) or wrongly as \(FP\)) which means that the biomarker selection method could be somewhat innacurate, since we can’t really know the prediction of which extra synergy or synergies the biomarkers’ state affected. To account for this, we label as good models those that predict large synergy sets (so fewer models) which capture almost all the true positive predictions and also minimize the possible extra different synergies predicted by models of the same classification category (e.g. the good models).

Starting with the first model classification method (prediction vs non-prediction of a particular synergy), we generate the density distribution of the nodes’ average state differences between the good and bad models for each predicted synergy:

diff.predicted.synergies.results = 
  sapply(predicted.synergies, function(drug.comb) {
    get.avg.activity.diff.based.on.specific.synergy.prediction(
      model.predictions, models.stable.state, drug.comb)
  })
diff.predicted.synergies.results = t(diff.predicted.synergies.results)

densities = apply(abs(diff.predicted.synergies.results), 1, density)
make.multiple.density.plot(densities, legend.title = "Predicted Synergies",
                           x.axis.label = "Activity state (absolute difference value)",
                           title = "Density Estimation for the Average State Difference")

Next, we will plot the biomarkers with the network visualization method (for all predicted synergies) and output the respective biomarkers in each case. The biomarker results for each predicted synergy will be stored for further comparison with the results from the other cell lines.

for (drug.comb in predicted.synergies) {
  title.text = paste0("Prediction of ", drug.comb, 
                      " synergy: Good models vs Bad Models")
  diff = diff.predicted.synergies.results[drug.comb, ]
  plot.network(net, diff, layout = nice.layout, title = title.text)
  biomarkers.active = diff[diff > threshold]
  biomarkers.inhibited = diff[diff < -threshold]
  print.biomarkers.for.specific.synergy(drug.comb, biomarkers.active, 
                                                   biomarkers.inhibited)
  save.vector.to.file(vector = biomarkers.active, file = paste0(
    biomarkers.dir, drug.comb, "_biomarkers_active"), with.row.names = TRUE)
  save.vector.to.file(vector = biomarkers.inhibited, file = paste0(
    biomarkers.dir, drug.comb, "_biomarkers_inhibited"), with.row.names = TRUE)
}

## [1] "Biomarkers for AK-PD synergy prediction"
## 
## [1] "Active biomarkers"
## [1] "17 nodes: MAP3K11,FGFR1,TAB1,PTPN7,MAX,PIK3CA,MAPK8IP1,MAPK9,JAK1,TYK2,JAK3,IFNGR2/INFGR1,IFNGR1,PTPN11,IFNGR2,IL2RB,IL10RA"
## 
## [1] "Inhibited biomarkers"
## [1] "34 nodes: DLX5,NR3C1,MAPK14,RPS6KA5,STAT3,DUSP1,MAP2K2,PI3K,PIK3CG,RXRA,SOCS3,TGFB1,HSPA1A,SALL4,RPS6K,PDPK1,RPS6KB1,PLK1,RPS6KA3,PRKCD,PRKCZ,PKN1,PRKCA,SGK3,PRKCB,PRKCG,PAK1,ROCK1,TGFBR1,TRAF6,RHOA,PIK3R1,FOXO3,CASP9"

## [1] "Biomarkers for BI-PD synergy prediction"
## 
## [1] "Active biomarkers"
## [1] "9 nodes: JAK1,TYK2,JAK3,IFNGR2/INFGR1,IFNGR1,PTPN11,IFNGR2,IL2RB,IL10RA"
## 
## [1] "Inhibited biomarkers"
## [1] "15 nodes: MAP3K7,MAP2K6,MAP2K3,NLK,IKBKG,STAT3,SOCS3,TGFB1,HSPA1A,SALL4,ROCK1,TGFBR1,TRAF6,RHOA,PIK3R1"

## [1] "Biomarkers for BI-PI synergy prediction"
## 
## [1] "Active biomarkers"
## [1] "0 nodes: "
## 
## [1] "Inhibited biomarkers"
## [1] "0 nodes: "

## [1] "Biomarkers for PD-PI synergy prediction"
## 
## [1] "Active biomarkers"
## [1] "0 nodes: "
## 
## [1] "Inhibited biomarkers"
## [1] "1 node: CASP9"

## [1] "Biomarkers for PI-D1 synergy prediction"
## 
## [1] "Active biomarkers"
## [1] "2 nodes: PTEN,PPM1A"
## 
## [1] "Inhibited biomarkers"
## [1] "3 nodes: CREB1,PtsIns(3,4,5)P3,PTK2"

We notice that for some predicted synergies the above method identified zero biomarkers. We will now study cases where the main goal is the better identification and/or refinement of the biomarkers responsible for allowing the models to predict one extra synergy from a specific synergy set. We will focus on the predicted synergies PD-PI, BI-PD and BI-PI.

Synergy-set prediction based analysis

PD-PI synergy

The first use case will contrast the models that predicted the synergy set BI-PD,PD-PI,PI-D1 vs the models that predicted the double synergy subset BI-PD,PI-D1. This will allow us to find important nodes whose state can influence the prediction performance of a model and make it predict the extra PD-PI synergy:

synergy.set.str    = "BI-PD,PD-PI,PI-D1"
synergy.subset.str = "BI-PD,PI-D1"
diff.PD.PI = get.avg.activity.diff.based.on.diff.synergy.set.prediction(
  synergy.set.str, synergy.subset.str, model.predictions, models.stable.state)

We now visualize the average state differences:

title.text = paste0("Good models (", synergy.set.str,
                    ") vs Bad Models (", synergy.subset.str, ")")
plot.network(net, diff.PD.PI, layout = nice.layout, title = title.text)

and report the active biomarkers found:

biomarkers.PD.PI.active = diff.PD.PI[diff.PD.PI > threshold]
pretty.print.vector.names(biomarkers.PD.PI.active)
## [1] "2 nodes: PSEN1,CEBPA"

Note that with the previous method of classifying the models (those that predict the PD-PI synergy and those that don’t), we couldn’t identify the 2 biomarkers that need to be in an active state for the prediction of this particular synergy. The inhibited biomarker is the same one as before:

biomarkers.PD.PI.inhibited = diff.PD.PI[diff.PD.PI < -threshold]
pretty.print.vector.names(biomarkers.PD.PI.inhibited)
## [1] "1 node: CASP9"

Lastly, we add the newly found biomarkers in the respective files:

drug.comb = "PD-PI"
update.biomarkers.files(biomarkers.dir, drug.comb, 
                        biomarkers.PD.PI.active, biomarkers.PD.PI.inhibited)
BI-PD synergy

The second use case will contrast the models that predicted the synergy set BI-PD,PD-PI,PI-D1 vs the models that predicted the double synergy subset PD-PI,PI-D1. This will allow us to find important nodes whose state can influence the prediction performance of a model and make it predict the extra BI-PD synergy:

synergy.set.str    = "BI-PD,PD-PI,PI-D1"
synergy.subset.str = "PD-PI,PI-D1"
diff.BI.PD = get.avg.activity.diff.based.on.diff.synergy.set.prediction(
  synergy.set.str, synergy.subset.str, model.predictions, models.stable.state)

We now visualize the average state differences:

title.text = paste0("Good models (", synergy.set.str,
                    ") vs Bad Models (", synergy.subset.str, ")")
plot.network(net, diff.BI.PD, layout = nice.layout, title = title.text)

and report the active biomarkers found:

biomarkers.BI.PD.active = diff.BI.PD[diff.BI.PD > threshold]
pretty.print.vector.names(biomarkers.BI.PD.active)
## [1] "4 nodes: PSEN1,CEBPA,MAPK8IP1,MAPK9"

Note that these biomarkers are completely different from the ones found with the previous method. Same is true for the inhibited biomarkers:

biomarkers.BI.PD.inhibited = diff.BI.PD[diff.BI.PD < -threshold]
pretty.print.vector.names(biomarkers.BI.PD.inhibited)
## [1] "1 node: RXRA"

Lastly, we add the newly found biomarkers in the respective files:

drug.comb = "BI-PD"
update.biomarkers.files(biomarkers.dir, drug.comb, 
                        biomarkers.BI.PD.active, biomarkers.BI.PD.inhibited)
BI-PI synergy

The third use case will contrast the models that predicted the synergy set BI-PI,PI-D1 vs the models that predicted the single synergy PI-D1. This will allow us to find important nodes whose state can influence the prediction performance of a model and make it predict the extra BI-PI synergy:

synergy.set.str = "BI-PI,PI-D1"
synergy.subset.str = "PI-D1"
diff.BI.PI = get.avg.activity.diff.based.on.diff.synergy.set.prediction(
  synergy.set.str, synergy.subset.str, model.predictions, models.stable.state)

We now visualize the average state differences:

title.text = paste0("Good models (", synergy.set.str,
                    ") vs Bad Models (", synergy.subset.str, ")")
plot.network(net, diff.BI.PI, layout = nice.layout, title = title.text)

and report the active biomarkers found:

biomarkers.BI.PI.active = diff.BI.PI[diff.BI.PI > threshold]
pretty.print.vector.names(biomarkers.BI.PI.active)
## [1] "2 nodes: MAPK8IP1,MAPK9"

We also report the inhibited biomarkers:

biomarkers.BI.PI.inhibited = diff.BI.PI[diff.BI.PI < -threshold]
pretty.print.vector.names(biomarkers.BI.PI.inhibited)
## [1] "1 node: RXRA"

Lastly, we add the newly found biomarkers in the respective files:

drug.comb = "BI-PI"
update.biomarkers.files(biomarkers.dir, drug.comb, 
                        biomarkers.BI.PI.active, biomarkers.BI.PI.inhibited)

Biomarker results

In this section, we will compare the biomarkers found per predicted synergy for this particular cell line as well as the performance biomarkers (notated as PERF in the heatmap below) which were found using the common results from the TP and MCC classification-based model analysis:

# Biomarkers from sections:
# `Synergy-prediction based analysis`, `Synergy-set prediction based analysis`
biomarkers.synergy.res =
  get.biomarkers.per.synergy(predicted.synergies, biomarkers.dir, models.dir)

# store biomarkers in one file
save.df.to.file(biomarkers.synergy.res, file =
                paste0(biomarkers.dir, "biomarkers_per_synergy"))

# Biomarkers from section:
# `Common performance-related biomarkers`
biomarkers.res = add.performance.biomarkers(biomarkers.synergy.res, 
                                            common.biomarkers.active, 
                                            common.biomarkers.inhibited)

# prune nodes which are not found as biomarkers for any predicted synergy or
# for better model performance
biomarkers.res = biomarkers.res[, colSums(biomarkers.res != 0) > 0]

superheat(biomarkers.res, title.alignment = "center",
          title = paste0("Biomarker results (", cell.line, ")"),
          row.dendrogram = FALSE, col.dendrogram = TRUE, scale = FALSE,
          column.title = "Network Nodes", row.title = "Predicted synergies",
          heat.pal = c("tomato", "grey", "gold"),
          # grey   = not a biomarker
          # tomato = inhibited biomarker
          # gold   = active biomarker
          force.grid.vline = TRUE, force.grid.hline = TRUE,
          grid.vline.size = 0.1, grid.hline.size = 0.1, 
          legend.breaks = c(-1, 0, 1),
          left.label.col = "white", left.label.text.size = 4,
          bottom.label.col = "white",
          bottom.label.text.size = 2, bottom.label.text.angle = 90)

So, in general we observe that:

  • A total of \(63\) nodes were found as biomarkers (for at least one synergy)
  • We found a lot of biomarkers for some synergies, but very few for some others. Usually we wouldn’t expect too many biomarkers that are directly related to the prediction of a specific synergy. The abudance of (false positive) biomarkers for some synergies (e.g. AK-PD) relates to the model classification method used, which does not incorporate in its internal logic that the prediction of other synergies than the ones used for the grouping itself can affect the biomarker results obtained from it
  • All the performance-related biomarkers (PERF) were also observed as biomarkers for the prediction of a specific synergy(ies)
  • There exist common biomarkers across different predicted synergies, e.g. RXRA is a common inhibited biomarker across 3 synergistic drug combinations
  • The results between the different synergies correlate in the sense that there is no active biomarker for a particular synergy that was found as inhibited in another and vise versa
  • The biomarkers of the PI-D1 synergy are not shared with any of the other predicted synergies’ biomarkers
  • The synergies BI-PD and AK-PD share many common biomarkers as well as with the performance-related biomarkers

TO BE REMOVED

Comparison of network visualization libraries: igraph, visNetwork, threejs

# igraph
plot.network(net, diff.mcc.3.4, layout = nice.layout,
             title = "Bad models, MCC: [0, 0.2) vs Good models, MCC: [0.2, 0.4]")

library("visNetwork") # version 2.0.5
plot.network.vis(net, diff.mcc.3.4, layout = nice.layout,
             title = "Bad models, MCC: [0, 0.2) vs Good models, MCC: [0.2, 0.4]")
library("threejs") # version 0.3.1
plot.network.3d(net, diff.mcc.3.4)

References

  1. Csardi G, Nepusz T (2006) The igraph software package for complex network research, InterJournal, Complex Systems 1695. http://igraph.org
  2. Ognyanova, K. (2018) Network visualization with R. Retrieved from www.kateto.net/network-visualization
  3. Barter Rebecca, Yu Bin (2017) Superheat: An R package for creating beautiful and extendable heatmaps for visualizing complex data. https://arxiv.org/abs/1512.01524, https://rlbarter.github.io/superheat/